Link to website:¶

https://frederik-n-h-lundgren.github.io/

Contributions:¶

The workload was divided equally

Github Repository:¶

https://github.com/Frederik-N-H-Lundgren/frederik-n-h-lundgren.github.io.git

In [1]:
import pandas as pd
import gzip
import json
from tqdm import tqdm
import networkx as nx
import netwulf
import numpy as np
import ast
import statistics
from itertools import chain
import math
import community
from collections import defaultdict
import matplotlib.pyplot as plt
from nltk.tokenize import MWETokenizer
from nltk.tokenize import word_tokenize
import re
from nltk.stem import PorterStemmer
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
from collections import Counter
from nltk.util import bigrams
from scipy.stats import chi2
from wordcloud import WordCloud

Motivation¶

Our dataset¶

Our dataset is a subset from the ready made large-scale Amazon Reviews dataset, collected in 2018 by Jianmo Ni, UCSD. The dataset was ready made and directly available for downloading at: https://nijianmo.github.io/amazon/index.html Justifying recommendations using distantly-labeled reviews and fined-grained aspects Jianmo Ni, Jiacheng Li, Julian McAuley EMNLP, 2019

The subset we have choosen is within the Petsupply category. The dataset contains two files: 1 for the reviews of the different products within the category and 1 for metadata of the products. For the reviews we have used the 5 core dataset which means there are at least 5 reviews for each product.

Reasoning for choice of dataset(s)¶

All of Amazon's information is huge,so, we'll concentrate on just one group. Pets are cute, and we love our own dog, and buy him a lot of toys and snaks so therefore we have choosen the pet category.

We chose to use the 5-core reviews because all the reviews was very large and too computationally heavy to work with, and it focuses on a specific subset of reviews,those considered most informative or relevan, which can save computational resources while still providing valuable insights.

Goal for the end user’s experience?¶

To identify What factors influence the co-purchasing patterns of pet supply customers, and these factors affect their product reviews? Also to find out if the price has impact on reviews.

Basic stats¶

Choices in data cleaning and preprocessing¶

The original contains much more attritubuts than those we have choosen to work with in this notebook, therefore we have deleted those colums and saved new files to work with.(this notebook does not contain the original data)

The 5-core reviews was still to big to process so we decided to only use the reviews from after 2017, this mean we are not guaranteed to have 5 reviews for each product, and we instead have a subset of the reviews from 2017-2018 We checked if the two dataframes contains dublicates and deleted those. The metadata products is checked if it is included in the reviews, meaning it had at least one review, otherwise it was deleted to make sure all products we work with have at a review. We also check the other way around, if the reviews product id is in the metadata.

later on when we work with the price the data will be cleaned ones more fore those product that don't have a price.

short section of the dataset stats¶

The data after cleaning have 6 attributs and 31696 items for the metadata: category,description,title,also_buy,price,asin(product id)

Review data have 5 attributs and 496672 reviews: asin,reviewText,overall(rating),reviewTime,unixReviewTime

Data cleaning¶

The dataset loaded in is a subset of the downloaded data as we only kept the colums that we think we will need, this reduces the runtime each time we need to restart the notebook

In [2]:
dfmeta = pd.read_csv('dfmeta.txt', sep='\t')
dfmeta['also_buy'] = dfmeta['also_buy'].apply(ast.literal_eval)
dfmeta['category'] = dfmeta['category'].apply(ast.literal_eval)
dfmeta = dfmeta.drop_duplicates(subset='asin')
In [3]:
df_review = pd.read_csv('df_review.txt', sep='\t')
df_review['reviewTime'] = pd.to_datetime(df_review['reviewTime'], format='%m %d, %Y')

# Filter out reviews made before 2010
df_review = df_review[df_review['reviewTime'].dt.year >= 2017]
#filter dublicates
df_review = df_review.drop_duplicates(subset='reviewText')
# Reset the index of the DataFrame
df_review.reset_index(drop=True, inplace=True)
# df_review now contains the DataFrame with reviews made in or after 2017
In [4]:
review_asin_values = df_review['asin'].unique()

# Filter df_edges based on whether 'asin' values are present in df_review
dfmeta = dfmeta[dfmeta['asin'].isin(review_asin_values)]
dfmeta.reset_index(drop=True, inplace=True)
In [5]:
# Filter df_review based on whether 'asin' values are present in dfmeta
df_review = df_review[df_review['asin'].isin(dfmeta['asin'])]
In [6]:
dfmeta
Out[6]:
category description title also_buy price asin
0 [Pet Supplies, Top Selection from AmazonPets] ['Volume 1: 96 Words & Phrases! This is th... Pet Media Feathered Phonics The Easy Way To Te... [B0002FP328, B0002FP32S, B0002FP32I, B00CAMARX... $6.97 0972585419
1 [Pet Supplies, Dogs, Health Supplies] ["Our Dog Whisperer with Cesar Milan Complete ... Dog Whisperer With Cesar Millan: Season 1 [B000QXDFSA, B0018BD9DK, B002RJ8YDM, B002UJIY3... NaN 1417084871
2 [Pet Supplies, Dogs, Treats] ['"You won\'t want to miss this one from Paris... The Healthy Hound Cookbook: Over 125 Easy Reci... [1617690554, 1449409938, 1604334657, 163220674... $14.75 1440572828
3 [Pet Supplies, Dogs, Health Supplies, Hip &amp... ['Dr. Rexy hemp oil has powerful anti-inflamma... DR.REXY Hemp Oil for Dogs and Cats - 100% Orga... [] $19.90 1612231977
4 [Pet Supplies, Dogs] ['At last! A comprehensive, holistic guide for... Natural Cures for Your Dog & Cat [] $19.91 1882330919
... ... ... ... ... ... ...
31691 [Pet Supplies, Dogs, Collars, Harnesses & Leas... ['Full Grip Supply Camo E-Bungee Collar is a r... Full Grip Supply Camo E-Bungee Collar for Educ... [B01KAX8QIO, B00W18D3F8, B005CXJ2OA, B005MJ65Z... $16.99 B01HIIJ4US
31692 [Pet Supplies, Cats, Flea & Tick Control, Flea... ['Kills fleas, flea eggs, flea larvae and cont... Sergeants Pet Care Prod 03282 Cat Flea/Tick Co... [] $3.34 B01HIJGHOS
31693 [Pet Supplies, Cats, Flea & Tick Control, Flea... ['Premium Quality Flea Comb for Dogs, Cats and... #1 Pet Flea Comb For Dogs And Cats By Pet's Mu... [B00JUQVR7U] NaN B01HIPJRBM
31694 [Pet Supplies, Dogs, Health Supplies, Suppleme... ['Advita for Dogs is a blend of multiple probi... VetOne Advita Probiotic Nutritional Supplement... [B00DCV5E28, B078Y63641, B006CBD7LK, B077GHNQG... $17.37 B01HIQ9NGU
31695 [] ['Latex Dog Toy Prepacks are creative combinat... Zanies small latex dog toy with squeaker Pack ... [] $17.99 B01HIV7FC4

31696 rows × 6 columns

Tools, theory and analysis¶

We ctreated a network, which was based of Amazon petsupply items as nodes, and if a co-purchase between to items was made, this would be an edge, which was found as an attribute "also_buy". Some basic graph analysis was executed, looking at the average degree, mode and other aspects. Here we also found items (nodes) that had highest degree, because these are items that are typically bought with other items, and are of interrest as a recommendation. After this we looked at the modularity, which gave a high number, so a community detection was the obvious next analysis. Here we found that the graph is compiled of one main graph, with a large number of really low degree nodes, or nodes with o degree. For the largest connected component, we again did some basic analysis, and we could then lead this community into an analysis of the reviews.

The reviews were converted to lowercase and tokenized. The tokens were created by exlueding punctuation, URLs, mathematical symbols, and numbers. All the tokens was stemmed and stopwords removed. All tokens were compiled into one comprehensive token list to identify some of the most commom words in all reviews. For each review we have computed a sentiment score, and this was made to an average score for each item to see if there is a correlation between sentiment and other scores. We checked for possible bigrams to see if there are some context that we might be missing that should be added. For each commuity we computed some of the most commen words and the IDF for these words to crete a better understandning of what maked it unique and sepereated from other.

Some of the analysis we did was on the higher and lower, ratings and sentimenct score to see if there was any factor that produced better and more liked items. We could not find a correlation between these, as the top items appeared basically identical to the bottom items in terms of plots and ratings. Another great way to help us understand some of the co-purchasing patterns was looking at word clouds for the different communities. This gave us an indication in what the theme of the products in each community was, and what items are bought together by the consumers.

Co-purchsing Network¶

In our network, nodes represent petsupply products, with a direct link from node A to node B indicating that they have been purchased together before.

In [7]:
# Initialize an undirected graph
G = nx.Graph()

# Add nodes from 'asin' column
G.add_nodes_from(dfmeta['asin'])
In [8]:
# Explode the 'also_buy' column to create multiple rows
df_exploded = dfmeta[['asin', 'also_buy']].explode('also_buy')
In [9]:
df_edges = df_exploded.dropna()
In [10]:
# Filter df_edges based on whether values in 'also_buy' column are in dfmeta['asin'] 
# remove things outside category
df_edges = df_edges[df_edges['also_buy'].isin(dfmeta['asin'])]
In [11]:
edges = [(row['asin'], row['also_buy']) for _, row in df_edges.iterrows()]
G.add_edges_from(edges)

Here is some basic network analysis

In [12]:
num_nodes = len(G.nodes)
num_edges = len(G.edges)

print("Products:", num_nodes)
print("Products bought together (edges):", num_edges)
Products: 31696
Products bought together (edges): 205745
In [13]:
max_edges = num_nodes * (num_nodes - 1) / 2
print("maximum amount of links:", max_edges)

print("Density of network:", (num_edges/max_edges)*100)
maximum amount of links: 502302360.0
Density of network: 0.04096038887812512

We can see that the network is not dense at all. We have very few edges in comparison with how many possible edges we can have.

In [14]:
print("Is the network connected?", nx.is_connected(G))
num_connected_components = nx.number_connected_components(G)
print("There is",num_connected_components, "subsets within the graph")
Is the network connected? False
There is 10457 subsets within the graph
In [15]:
isolated_nodes = [node for node, degree in G.degree() if degree == 0]
print("There is:", len(isolated_nodes),"isolated nodes/products in the network")
There is: 10316 isolated nodes/products in the network

As we could see from the dataframe there are some products that havn't been purchsed with other products therefore the network will not be connected. By browsing through the data we could also see that there are a big amount of products that have no also_buy, and therefore the network is not dense which aligns with our expactation.

In [16]:
degrees = [degree for node, degree in G.degree()]
avg_degree = np.mean(degrees)
median_degree = statistics.median(degrees)

degree_hist = nx.degree_histogram(G)
mode_degree = degree_hist.index(max(degree_hist))

min_degree = min(degrees)
max_degree = max(degrees)

print("Average:", avg_degree)
print("Median:", median_degree)
print("Mode:", mode_degree)
print("Minimum:", min_degree)
print("Maximum:", max_degree)
Average: 12.982395254921757
Median: 3.0
Mode: 0
Minimum: 0
Maximum: 1171

We can see from these numbers that the product bought most commmenly bought with other items, was bought with 1171 product and there is also a lot of produts that isn't bought with other products, which is indicated by the mode being 0. We can see from the median that the node distrubution from these number seems very skewered, therefore we can't get much information from the averge.

Find the top items by the largest degree¶

In [17]:
degrees = dict(G.degree())

# Sort the nodes based on their degree in descending order
sorted_nodes = sorted(degrees, key=degrees.get, reverse=True)

# Get the top 5 nodes with the highest degree
top_5_nodes = sorted_nodes[:5]
print("Top 5 nodes with the highest degree:")
for node in top_5_nodes:
    print("Node:", node, "Degree:", degrees[node])
Top 5 nodes with the highest degree:
Node: B001HBBQKY Degree: 1171
Node: B0009X29WK Degree: 858
Node: B000255NCI Degree: 800
Node: B0002A5VK2 Degree: 714
Node: B0002563MW Degree: 696
In [18]:
for node in top_5_nodes:
    node_attributes = dfmeta.loc[dfmeta['asin'] == node].squeeze()
    print(node_attributes)
category       [Pet Supplies, Dogs, Treats, Cookies, Biscuits...
description    ['Wellness Just for Puppy Natural Dog Treats a...
title           Wellness Soft Puppy Bites Natural Grain Free ...
also_buy                                                      []
price                                                      $2.99
asin                                                  B001HBBQKY
Name: 6741, dtype: object
category       [Pet Supplies, Cats, Litter & Housebreaking, L...
description    ['A clay litter uniquely formulated combining ...
title          Dr. Elsey's Cat Ultra Premium Clumping Cat Lit...
also_buy                                                      []
price          .a-box-inner{background-color:#fff}#alohaBuyBo...
asin                                                  B0009X29WK
Name: 2950, dtype: object
category       [Pet Supplies, Fish & Aquatic Pets, Aquarium T...
description    ['Most water problems are invisible to the eye...
title                                       API Master Test Kits
also_buy                                                      []
price                                                     $14.95
asin                                                  B000255NCI
Name: 245, dtype: object
category       [Pet Supplies, Fish & Aquatic Pets, Aquari...
description                    ['Seachem Purigen 100ml', '', '']
title             Seachem Purigen for Freshwater & Saltwater
also_buy       [B00029PO6O, B00BS96U60, B00B50UPE0, B00JE5W4Y...
price                                                      $8.22
asin                                                  B0002A5VK2
Name: 628, dtype: object
category       [Pet Supplies, Fish & Aquatic Pets, Aquarium P...
description    ['The Penn Plax Airline Tubing for Aquariums i...
title          Penn Plax Airline Tubing for Aquariums –...
also_buy                                                      []
price                                                      $4.97
asin                                                  B0002563MW
Name: 294, dtype: object

We can see that 3 of the top 5 product from Fish & Aquatic Pets(same group) and there is 1 for cat and the most purched is a dogtreat.

Components i graph¶

As we what to analyse co-purchases we want to see the components in our graph eg. the nodes that have connections to others.

In [19]:
components = nx.connected_components(G)
component_sizes = [len(component) for component in components]
large_sorted_sizes = sorted(component_sizes, reverse=True)

# Select the sizes of the 50 largest components
largest_sizes = large_sorted_sizes[:25]
In [20]:
largest_sizes
Out[20]:
[21049, 6, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

A lot of the products are just bought as a pair, or smaller groups, with no connection to the larger graph.

In [21]:
components = nx.connected_components(G)
largest_component = max(components, key=len)

# Create a new graph containing only the nodes and edges of the largest component
largest_component_graph = G.subgraph(largest_component)
In [22]:
len(largest_component_graph.nodes())
Out[22]:
21049
In [23]:
len(largest_component_graph.edges())
Out[23]:
205548
In [24]:
degrees = [degree for node, degree in largest_component_graph.degree()]
avg_degree = np.mean(degrees)
median_degree = statistics.median(degrees)

degree_hist = nx.degree_histogram(largest_component_graph)
mode_degree = degree_hist.index(max(degree_hist))

min_degree = min(degrees)
max_degree = max(degrees)

print("Average:", avg_degree)
print("Median:", median_degree)
print("Mode:", mode_degree)
print("Minimum:", min_degree)
print("Maximum:", max_degree)
Average: 19.53042899900233
Median: 9
Mode: 1
Minimum: 1
Maximum: 1171

As majorty of the nodes are in the largest component we want to analyse this specificlly to uncover some of the co-purchasing patterns. In this graph we see that the minimun is now 1 and we have a much bigger median and average as the nodes with 0 degree is no longer a part of it, but the distrubution is still skewered.

Making communities based on largest component¶

In [25]:
def compute_modularity(graph, partitioning):
    # Total number of edges in the graph
    L = graph.number_of_edges()
    modularity = 0

    for community in set(partitioning.values()):
        # Nodes in the current community
        nodes_in_community = [node for node, comm in partitioning.items() if comm == community]
        L_c = sum(1 for u, v in graph.edges(nodes_in_community) if partitioning[u] == partitioning[v])
        k_c = sum(graph.degree(node) for node in nodes_in_community)
        
        modularity += L_c / L - (k_c / (2 * L)) ** 2
        
    return modularity
In [26]:
partition = community.best_partition(largest_component_graph)
In [27]:
print("The amount of communities is:", len(set(partition.values())), "which is community",set(partition.values()) )
The amount of communities is: 45 which is community {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44}
In [28]:
items_modularity = compute_modularity(largest_component_graph,partition)
items_modularity
Out[28]:
0.7829315576241409

As we have a really high modularity we expect to see some clear communities, with similar items being close to eachother.

In [29]:
group_counts = defaultdict(int)
for node, group_id in partition.items():
    group_counts[group_id] += 1

sorted_groups = sorted(group_counts.items(), key=lambda x: x[1], reverse=True)

# Print the count of nodes in each group
for group_id, count in sorted_groups:
    print("Group", group_id, "has", count, "nodes")
Group 2 has 4402 nodes
Group 3 has 3230 nodes
Group 5 has 2741 nodes
Group 16 has 2216 nodes
Group 8 has 1923 nodes
Group 34 has 1247 nodes
Group 0 has 1060 nodes
Group 37 has 960 nodes
Group 18 has 711 nodes
Group 15 has 611 nodes
Group 6 has 470 nodes
Group 30 has 337 nodes
Group 20 has 239 nodes
Group 27 has 168 nodes
Group 23 has 109 nodes
Group 35 has 109 nodes
Group 7 has 92 nodes
Group 19 has 80 nodes
Group 21 has 71 nodes
Group 43 has 31 nodes
Group 28 has 27 nodes
Group 24 has 21 nodes
Group 29 has 20 nodes
Group 22 has 18 nodes
Group 11 has 18 nodes
Group 12 has 15 nodes
Group 14 has 14 nodes
Group 41 has 12 nodes
Group 26 has 12 nodes
Group 44 has 11 nodes
Group 36 has 10 nodes
Group 40 has 8 nodes
Group 33 has 6 nodes
Group 10 has 6 nodes
Group 4 has 5 nodes
Group 39 has 5 nodes
Group 31 has 5 nodes
Group 42 has 4 nodes
Group 38 has 4 nodes
Group 9 has 4 nodes
Group 13 has 4 nodes
Group 17 has 4 nodes
Group 25 has 3 nodes
Group 32 has 3 nodes
Group 1 has 3 nodes
In [30]:
dfmeta['group'] = dfmeta['asin'].map(partition)
In [31]:
for node in tqdm(largest_component_graph.nodes()):
    largest_component_graph.nodes[node]['group'] = dfmeta.loc[dfmeta['asin'] == node, 'group'] .values[0]
100%|████████████████████████████████████| 21049/21049 [00:44<00:00, 470.88it/s]
In [32]:
netwulf.interactive.visualize(largest_component_graph)
plt.show()
In [33]:
assortativity_degree = nx.degree_assortativity_coefficient(largest_component_graph)
assortativity_degree
Out[33]:
-0.06030633355263227

Degree assortativity measures the tendency of nodes with similar degrees to be connected to each other in a network. A positive assortativity coefficient indicates that nodes with similar degrees are more likely to be connected, while a negative coefficient suggests that nodes with different degrees tend to be connected. A score of -0.0602 indicates the structure is like a random network, where connections are made without any preference based on node degrees. This indicates that there is no significant tendency for nodes with similar degrees to be connected.

Review analysis¶

In [34]:
df_review
Out[34]:
asin reviewText overall reviewTime unixReviewTime
0 1440572828 I was curious about making home cooked food to... 5.0 2017-04-21 1492732800
1 1440572828 Really good book 5.0 2017-03-22 1490140800
2 1440572828 Wish you had more recipes for treats 3.0 2017-02-12 1486857600
3 1440572828 Nice book. Can't wait to try these recipes 5.0 2017-02-08 1486512000
4 1612231977 I am disappointed in the quality of these. Th... 1.0 2018-03-29 1522281600
... ... ... ... ... ...
497007 B01HIQ9NGU It did no harm, but hard to see any improvemen... 4.0 2018-06-01 1527811200
497008 B01HIV7FC4 These are not rounded. I bought them for my li... 4.0 2017-11-26 1511654400
497009 B01HIV7FC4 My destroyer French Bulldog was not able to de... 5.0 2017-09-21 1505952000
497010 B01HIV7FC4 This is one of my dog's favorite toys, but all... 4.0 2017-06-16 1497571200
497011 B01HIV7FC4 Best toy we've purchased for our new puppy. Ea... 5.0 2017-05-04 1493856000

496672 rows × 5 columns

In [35]:
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [token for token in tokens if token not in stop_words]
In [36]:
def tokenize_and_preprocess(text):
    if isinstance(text, str):
        # Tokenize the text
        tokens = word_tokenize(text.lower())

        # Remove punctuation, URLs, mathematical symbols, and numbers
        tokens = [token for token in tokens if re.match(r'^[a-zA-Z]+$', token)]

        tokens = remove_stopwords(tokens)

        # Apply stemming
        porter = PorterStemmer()
        tokens = [porter.stem(token) for token in tokens]


        return tokens
    else:
        return []
In [37]:
df_review['review_tokens'] = df_review["reviewText"].apply(tokenize_and_preprocess)
In [38]:
comprehensive_tokens = list(chain.from_iterable(df_review['review_tokens']))
In [39]:
# Create a Counter object to count the frequency of each word
word_counts = Counter(comprehensive_tokens)

# Get the 10 most common words
most_common_words = word_counts.most_common(10)
most_common_words
Out[39]:
[('dog', 229531),
 ('love', 159879),
 ('cat', 135637),
 ('like', 111007),
 ('use', 110351),
 ('one', 108985),
 ('great', 98700),
 ('get', 87993),
 ('work', 85084),
 ('would', 68607)]

Dog and cat are some of the most mentioned words, and most of the common words are positive

Sentiment analysis¶

In [40]:
# initialize NLTK sentiment analyzer

analyzer = SentimentIntensityAnalyzer()

# create get_sentiment function

def get_sentiment(tokens):
    text = ' '.join(tokens)
    scores = analyzer.polarity_scores(text)
    sentiment = scores['compound'] 

    return sentiment


# apply get_sentiment function
df_review['sentiment'] = df_review['review_tokens'].apply(get_sentiment)

We then check if sentiment score correlate with the rating score.

In [41]:
import matplotlib.pyplot as plt

# Assuming 'df_review' is your DataFrame containing review data

# Plot overall rating against sentiment
plt.scatter(df_review['overall'], df_review['sentiment'], alpha=0.5)
plt.xlabel('Overall Rating')
plt.ylabel('Sentiment')
plt.title('Overall Rating vs Sentiment')
plt.show()

There are only 5 possible values of rating and because of the sheer size of the review dataset, alomst every sentiment score is given to each of the rating. So from this plot we can't tell anything. This is why we also want to try and average the rating, so we can get some more variation and perhaps see a correlation.

Averge sentiment and rating¶

In [42]:
grouped_reviews = df_review.groupby('asin')

# Step 2: Aggregate 'reviewText' and 'overall' into lists
aggregated_reviews = grouped_reviews.agg({'review_tokens': list, 'overall': list, 'sentiment': list}).reset_index()

# Step 3: Calculate average rating for each product
aggregated_reviews['average_rating'] = aggregated_reviews['overall'].apply(lambda x: sum(x) / len(x))
aggregated_reviews['average_sentiment'] = aggregated_reviews['sentiment'].apply(lambda x: sum(x) / len(x))


# Step 4: Merge with your other DataFrame
merged_df = pd.merge(dfmeta, aggregated_reviews[['asin', 'review_tokens', 'average_rating','average_sentiment']], on='asin', how='left')

# Define a function to flatten the list of tokens
def flatten_tokens(tokens_list):
    return list(chain.from_iterable(tokens_list))

# Apply the function to flatten the tokens in each row
merged_df['review_tokens'] = merged_df['review_tokens'].apply(flatten_tokens)
In [43]:
merged_df
Out[43]:
category description title also_buy price asin group review_tokens average_rating average_sentiment
0 [Pet Supplies, Top Selection from AmazonPets] ['Volume 1: 96 Words &amp; Phrases! This is th... Pet Media Feathered Phonics The Easy Way To Te... [B0002FP328, B0002FP32S, B0002FP32I, B00CAMARX... $6.97 0972585419 0.0 [ok, much, time, word, bird, get, bore, good] 2.000000 0.246000
1 [Pet Supplies, Dogs, Health Supplies] ["Our Dog Whisperer with Cesar Milan Complete ... Dog Whisperer With Cesar Millan: Season 1 [B000QXDFSA, B0018BD9DK, B002RJ8YDM, B002UJIY3... NaN 1417084871 NaN [thought, show, would, teach, whisper, dog, in... 5.000000 0.541600
2 [Pet Supplies, Dogs, Treats] ['"You won\'t want to miss this one from Paris... The Healthy Hound Cookbook: Over 125 Easy Reci... [1617690554, 1449409938, 1604334657, 163220674... $14.75 1440572828 34.0 [curiou, make, home, cook, food, supplement, n... 4.100000 0.543700
3 [Pet Supplies, Dogs, Health Supplies, Hip &amp... ['Dr. Rexy hemp oil has powerful anti-inflamma... DR.REXY Hemp Oil for Dogs and Cats - 100% Orga... [] $19.90 1612231977 NaN [disappoint, qualiti, significantli, deterior,... 4.734694 0.612665
4 [Pet Supplies, Dogs] ['At last! A comprehensive, holistic guide for... Natural Cures for Your Dog &amp; Cat [] $19.91 1882330919 NaN [great, inform] 5.000000 0.624900
... ... ... ... ... ... ... ... ... ... ...
31691 [Pet Supplies, Dogs, Collars, Harnesses & Leas... ['Full Grip Supply Camo E-Bungee Collar is a r... Full Grip Supply Camo E-Bungee Collar for Educ... [B01KAX8QIO, B00W18D3F8, B005CXJ2OA, B005MJ65Z... $16.99 B01HIIJ4US 2.0 [awesom, littl, gadget, put, perfectli, time, ... 4.500000 0.612425
31692 [Pet Supplies, Cats, Flea & Tick Control, Flea... ['Kills fleas, flea eggs, flea larvae and cont... Sergeants Pet Care Prod 03282 Cat Flea/Tick Co... [] $3.34 B01HIJGHOS NaN [got, new, puppi, ranch, rais, aka, flea, caus... 2.250000 0.290050
31693 [Pet Supplies, Cats, Flea & Tick Control, Flea... ['Premium Quality Flea Comb for Dogs, Cats and... #1 Pet Flea Comb For Dogs And Cats By Pet's Mu... [B00JUQVR7U] NaN B01HIPJRBM NaN [long, hair, cat, love, never, like, brush, lo... 4.500000 0.697500
31694 [Pet Supplies, Dogs, Health Supplies, Suppleme... ['Advita for Dogs is a blend of multiple probi... VetOne Advita Probiotic Nutritional Supplement... [B00DCV5E28, B078Y63641, B006CBD7LK, B077GHNQG... $17.37 B01HIQ9NGU 8.0 [good, probiot, mild, stress, coliti, dog, gre... 4.428571 0.437914
31695 [] ['Latex Dog Toy Prepacks are creative combinat... Zanies small latex dog toy with squeaker Pack ... [] $17.99 B01HIV7FC4 2.0 [round, bought, littl, dog, she, abl, pick, mo... 4.500000 0.269575

31696 rows × 10 columns

In [44]:
top_5_sentiment_item = merged_df.nlargest(5, 'average_sentiment')
bottom_5_sentiment_item = merged_df.nsmallest(5, 'average_sentiment')
In [45]:
top_5_sentiment_item
Out[45]:
category description title also_buy price asin group review_tokens average_rating average_sentiment
4287 [Pet Supplies, Cats, Litter & Housebreaking, L... ["Never touch cat litter again. The new and un... CatGenie-Self Washing, Self Flushing Cat Box [] NaN B000MKHQG4 NaN [purchas, catgeni, directli, manufactur, follo... 1.0 0.9975
11607 [Pet Supplies, Dogs, Collars, Harnesses & Leas... ["Like to run hands-free with your dog? Want t... OllyDog Mt Tam Hands-Free Dog Leash and Runnin... [] NaN B005URI7DA NaN [month, use, still, entir, convinc, greatest, ... 5.0 0.9963
18533 [Pet Supplies, Fish & Aquatic Pets, Aquariums ... ['Aqueon brings a new stage of aquatic health ... Aqueon BettaBow LED Desktop Fish Aquarium Kit [] $43.99 B00INCRQMW NaN [bought, two, amazon, sure, review, show, veri... 5.0 0.9952
14449 [Pet Supplies, Cats, Beds & Furniture, Cat Tre... ["Kitty'scape - A Whole New Way for Cats to Pl... Solvit Kittyscape Cat Tree House Extra Large C... [] $99.99 B00B19BGE8 NaN [third, kittyscap, cat, tree, hous, bought, go... 5.0 0.9943
8844 [Pet Supplies, Dogs, Health Supplies, Suppleme... ["A Natural Whole food Herbal Multi-Vitamin an... Dr. Harvey's MultiVitamin Mineral &amp; Herbal... [] $14.69 B003NTTYLQ 8.0 [email, harvey, websit, dentist, way, get, ful... 5.0 0.9938
In [46]:
top_5_sentiment_item["title"][8844]
Out[46]:
"Dr. Harvey's MultiVitamin Mineral &amp; Herbal Supplement"
In [47]:
bottom_5_sentiment_item
Out[47]:
category description title also_buy price asin group review_tokens average_rating average_sentiment
21148 [] [] Aquatic Arts 5 Live Freshwater Black Diamond S... [B00CF0A7ZQ, B00HJEYUU6, B00GWMTT0C, B005CTKE4... $23.95 B00NUGP2FE 16.0 [one, shrimp, dead, arriv, anoth, one, look, e... 3.0 -0.9727
16369 [Pet Supplies, Dogs, Treats, Bully Sticks] ["Downtown Pet Supply Curly Bully Sticks are 1... Downtown Pet Supply Best Free Range 10&quot; T... [] $69.99 B00DU23H5K 5.0 [horrif, experi, mini, goldendoodl, chew, one,... 1.0 -0.9654
13751 [Pet Supplies, Cats, Cat Doors, Steps, Nets & ... ['The Cat Mate Super Selective Chip and Disc c... Pet Mate Cat Mate Elite Chip And Disc Supersel... [] $114.95 B009GODTTK NaN [mine, two, annoy, failur, unit, lost, abil, r... 2.0 -0.9442
8292 [Pet Supplies, Birds, Cages & Accessories, Bir... ['HQ\'s Opening Dome Top Parrot cage is a perf... HQ Open Dometop Birdcage with Stand [B00BUEV9AU, B00GM49Y5U, B00LE594M0] $148.91 B002USI7YG 0.0 [lot, hesit, cage, final, satisfactori, conur,... 3.0 -0.9432
12620 [Pet Supplies, Fish & Aquatic Pets, Aquarium W... ['Excessive algae growth is the most common co... Fritz Aquatics AFA48016 Algae Clean Out for Aq... [] $15.55 B007GCE2W2 NaN [instanc, requir, repeat, use, work, use, tank... 4.0 -0.9201

From the top 5 and bottom 5 we can't really see any pattern for which type of product gets a high or low sentiment. 4/5 in top5 don't have a group. but we can see that the sentiment do not always represent the rating value. The top one of the top 5 have a very high sentiment but the averge rating for this product is only 1.

In [48]:
# Plot overall rating against sentiment
plt.scatter(merged_df['average_rating'], merged_df['average_sentiment'], alpha=0.5)
plt.xlabel('Average Overall Rating')
plt.ylabel('Average Sentiment')
plt.title('Overall Rating vs Sentiment')
plt.show()

We see a better correlation in this plot compared to the last plot. There are a slight positive trend which means that higher rating will also give a higher sentiment. We still see the stripes on the 5 values that the ratings can be. This is because some product only have 1 reviews that makes the rating value the averge rating value.

Does price have an impact?¶

We see in the dataframe that some of the items dont have a price, so to see if price has a impact we focus solely on products with a price, and dropped the entries without.

In [49]:
merged_df.dropna(subset=['price'], inplace=True)
merged_df['price'] = merged_df['price'].str.replace('$', '')
In [50]:
# Replace non-numeric values with NaN
merged_df['price'] = pd.to_numeric(merged_df['price'], errors='coerce')

# Now, convert the column to float
merged_df['price'] = merged_df['price'].astype(float)
In [51]:
# Plot overall rating against price
plt.scatter(merged_df['average_rating'], merged_df['price'], alpha=0.5)
plt.xlabel('Average Overall Rating')
plt.ylabel('Price')
plt.yscale('log')
plt.title('Overall Rating vs Price')
plt.show()
In [52]:
# Plot overall sentiment against price
plt.scatter( merged_df['average_sentiment'],merged_df['price'], alpha=0.5)
plt.xlabel('Average Sentiment')
plt.ylabel('Price')
plt.yscale('log')
plt.title('Sentiment vs Price')
plt.show()

From these two plots we opserve that the price does not correlate with the rating or sentiment of the item.

Bigrams¶

We want to check if we are missing some contextual understanding and semantic information therefore we check for bigrams.

In [53]:
bigrams = list(bigrams(comprehensive_tokens))
In [54]:
def compute_contingency_tables(input_bigrams):
    # Count occurrences of each bigram and its components
    bigram_counter = Counter(input_bigrams)
    word1_counter = Counter(word1 for word1, _ in input_bigrams)
    word2_counter = Counter(word2 for _, word2 in input_bigrams)

    # Compute contingency tables for each unique bigram
    contingency_tables = {}
    for bigram in tqdm(set(input_bigrams)):
        word1, word2 = bigram
        n_ii = bigram_counter[bigram]
        n_io = word1_counter[word1] - n_ii
        n_oi = word2_counter[word2] - n_ii
        n_oo = len(input_bigrams) - n_ii - n_io - n_oi

        contingency_tables[bigram] = {
            'n_ii': n_ii,
            'n_io': n_io,
            'n_oi': n_oi,
            'n_oo': n_oo
        }

    return contingency_tables
In [55]:
contingency_tables = compute_contingency_tables(bigrams)
100%|█████████████████████████████| 2173253/2173253 [00:07<00:00, 310377.49it/s]
In [56]:
def compute_expected_contingency_tables(contingency_tables):
    expected_contingency_tables = {}
    for bigram, contingency_table in tqdm(contingency_tables.items()):
        n_ii = contingency_table['n_ii']
        n_io = contingency_table['n_io']
        n_oi = contingency_table['n_oi']
        n_oo = contingency_table['n_oo']

        R1 = n_ii + n_io
        C1 = n_ii + n_oi
        R2 = n_oi + n_oo
        C2 = n_io + n_oo
        N = R1 + C1 + R2 + C2

        expected_table = {
            '(R1 * C1) / N': (R1 * C1) / N,
            '(R1 * C2) / N': (R1 * C2) / N,
            '(R2 * C1) / N': (R2 * C1) / N,
            '(R2 * C2) / N': (R2 * C2) / N }
        
        expected_contingency_tables[bigram] = expected_table

    return expected_contingency_tables
In [57]:
expected_contingency_tables = compute_expected_contingency_tables(contingency_tables)
100%|█████████████████████████████| 2173253/2173253 [00:06<00:00, 324126.37it/s]
In [58]:
def compute_chi_squared_statistics(observed_contingency_table, expected_contingency_table):
    chi_squared_statistics = {}
    for bigram, observed_values in tqdm(observed_contingency_table.items()):
        expected_values = expected_contingency_table[bigram]
        chi_squared = []
        for i in range(len(observed_values.values())):   
            Oij = list(observed_values.values())[i]
            Eij = list(expected_values.values())[i]
            chi_squared.append(((Oij - Eij) ** 2) / Eij)
        chi_squared_statistics[bigram] = sum(chi_squared)
    return chi_squared_statistics
In [59]:
chi_squared = compute_chi_squared_statistics(contingency_tables,expected_contingency_tables)
100%|█████████████████████████████| 2173253/2173253 [00:13<00:00, 155523.50it/s]
In [60]:
def compute_p_value(chi_squared_statistics):
    p_values = {}
    for key, chi_squared_statistic in tqdm(chi_squared_statistics.items()):
        p_value = chi2.sf(chi_squared_statistic, df=1)
        p_values[key] = p_value
    return p_values
In [61]:
p_values = compute_p_value(chi_squared)
100%|██████████████████████████████| 2173253/2173253 [02:13<00:00, 16248.45it/s]
In [62]:
# Find collocations
collocations = []
for bigram, appearance in tqdm(contingency_tables.items()):
    if list(appearance.values())[0] > 50 and p_values.get(bigram, float('inf')) < 0.001:
        collocations.append((bigram, list(appearance.values())[0]))
100%|████████████████████████████| 2173253/2173253 [00:01<00:00, 1145357.45it/s]
In [63]:
len(collocations)
Out[63]:
22693
In [64]:
collocations.sort(key=lambda x: x[1], reverse=True)
collocations[:20]
Out[64]:
[(('dog', 'love'), 31576),
 (('cat', 'love'), 17892),
 (('work', 'great'), 15663),
 (('work', 'well'), 13464),
 (('well', 'made'), 8932),
 (('highli', 'recommend'), 8708),
 (('great', 'product'), 8502),
 (('dog', 'food'), 8389),
 (('litter', 'box'), 8338),
 (('dog', 'like'), 8283),
 (('year', 'old'), 7598),
 (('cat', 'like'), 6318),
 (('realli', 'like'), 6276),
 (('good', 'qualiti'), 5947),
 (('seem', 'like'), 5894),
 (('small', 'dog'), 5856),
 (('great', 'price'), 5177),
 (('would', 'recommend'), 5071),
 (('look', 'like'), 4895),
 (('last', 'long'), 4876)]

When we look at the bigrams, it doesn't look like there is much contextual understanding and semantic information that is not captured by the unigrams. There is no negative that changes the sentiment or anything that can change the meaning of the words. This is not all the bigrams, but we can comfortably say that the top 20 bigrams do not incluede a negative starter word eg. "Not love"

Most commen words and IDF¶

To get a better understanding for each group we will compute some of the most commen words and thier IDF for the 5 groups with most members.

In [65]:
group_counter = Counter(partition.values())

# Get the top 5 groups with largest number of items
top_5_groups = group_counter.most_common(5)
top_5_groups
Out[65]:
[(2, 4402), (3, 3230), (5, 2741), (16, 2216), (8, 1923)]
In [66]:
groups = merged_df.groupby('group')
In [67]:
item_groups = {}
# Drop rows where 'group' is NaN
valid_groups = merged_df.dropna(subset=['group'])

# Iterate over each group
for i in tqdm(list(set(valid_groups["group"]))):
    nest = set(tuple(tokens) for tokens in groups.get_group(i)["review_tokens"].tolist())
    nested_lists = list(nest)  # Convert set of tuples to a list of tuples
    # Flatten the list of tuples into a single list using list comprehension
    item_groups[i] = [item for sublist in nested_lists for item in sublist]
100%|███████████████████████████████████████████| 45/45 [00:01<00:00, 41.36it/s]
In [68]:
def calculate_idf(word_freq, dataframe):
    # Total number of documents in the DataFrame
    total_documents = len(dataframe)

    # Count the number of documents containing each word
    documents_containing_word = defaultdict(int)
    for word, _ in word_freq:
        for tokens in dataframe['review_tokens']:
            if word in tokens:
                documents_containing_word[word] += 1

    # Calculate IDF for each word
    idf_values = {}
    for word, _ in word_freq:
        if documents_containing_word[word] != 0:
            idf_values[word] = math.log(total_documents / documents_containing_word[word])
        else:
            idf_values[word] = 0  # Handle the case where a word doesn't appear in any document

    return idf_values
In [69]:
for group,_ in top_5_groups:
    print("")
    print(f"Group {group}")
    most_common_ = Counter(item_groups[group]).most_common(10)
    idf_ = calculate_idf(most_common_, merged_df)

    for word, feq in most_common_:
        idf = idf_.get(word, 0)  # Get IDF value for the word from idf_g2 dictionary
        print(f"Word: {word}, feq: {feq}, IDF: {idf}")
    
Group 2
Word: dog, feq: 77482, IDF: 0.6089354893950956
Word: love, feq: 38659, IDF: 0.4576897882944777
Word: one, feq: 28581, IDF: 0.5821226163716092
Word: great, feq: 27079, IDF: 0.5069164280454364
Word: use, feq: 23761, IDF: 0.5781915896146571
Word: like, feq: 23597, IDF: 0.5195414570013981
Word: toy, feq: 23593, IDF: 1.8884657402044895
Word: get, feq: 21089, IDF: 0.6331186897731702
Word: work, feq: 17730, IDF: 0.713956792023889
Word: well, feq: 17471, IDF: 0.69614119477255

Group 3
Word: cat, feq: 75739, IDF: 1.3086691990893544
Word: love, feq: 28619, IDF: 0.4576897882944777
Word: like, feq: 21115, IDF: 0.5195414570013981
Word: one, feq: 20710, IDF: 0.5821226163716092
Word: litter, feq: 18233, IDF: 2.9433185308699508
Word: use, feq: 18149, IDF: 0.5781915896146571
Word: food, feq: 16496, IDF: 1.5515231735597126
Word: get, feq: 15395, IDF: 0.6331186897731702
Word: great, feq: 12350, IDF: 0.5069164280454364
Word: box, feq: 11937, IDF: 1.9633761958764524

Group 5
Word: dog, feq: 39088, IDF: 0.6089354893950956
Word: love, feq: 24446, IDF: 0.4576897882944777
Word: food, feq: 18655, IDF: 1.5515231735597126
Word: like, feq: 13248, IDF: 0.5195414570013981
Word: treat, feq: 13215, IDF: 1.7148261093696961
Word: one, feq: 8405, IDF: 0.5821226163716092
Word: eat, feq: 7803, IDF: 1.502397194756801
Word: cat, feq: 7354, IDF: 1.3086691990893544
Word: good, feq: 6927, IDF: 0.6504733889691939
Word: get, feq: 6650, IDF: 0.6331186897731702

Group 16
Word: tank, feq: 14028, IDF: 2.3010914407372622
Word: use, feq: 7459, IDF: 0.5781915896146571
Word: work, feq: 7449, IDF: 0.713956792023889
Word: water, feq: 7430, IDF: 1.6344108333654803
Word: fish, feq: 7338, IDF: 2.299434439529633
Word: great, feq: 6730, IDF: 0.5069164280454364
Word: filter, feq: 6565, IDF: 3.040048157328502
Word: one, feq: 5439, IDF: 0.5821226163716092
Word: like, feq: 4491, IDF: 0.5195414570013981
Word: good, feq: 4142, IDF: 0.6504733889691939

Group 8
Word: dog, feq: 25476, IDF: 0.6089354893950956
Word: use, feq: 11321, IDF: 0.5781915896146571
Word: work, feq: 10772, IDF: 0.713956792023889
Word: product, feq: 9336, IDF: 0.7319228839588409
Word: like, feq: 7161, IDF: 0.5195414570013981
Word: get, feq: 6778, IDF: 0.6331186897731702
Word: love, feq: 6456, IDF: 0.4576897882944777
Word: great, feq: 6264, IDF: 0.5069164280454364
Word: one, feq: 5743, IDF: 0.5821226163716092
Word: flea, feq: 5549, IDF: 3.7436666377557426

The animal isn't the main factor to divide the category, but it's factors like the use of the product, as we can see in group 16 where food is quite unique. SO even with high frequency words, we get a much better understanding of the groups items, when we look at IDF.

Top and buttom nodes by degree¶

In [70]:
degrees = dict(G.degree())
In [71]:
degrees = sorted(degrees.items(), key=lambda x: x[1], reverse=True)
In [72]:
top_1000_nodes = [node for node, degrees in degrees[:1000]]
bottom_1000_nodes = [node for node, degrees in degrees[-1000:]]

# Filter merged_df based on the top 100 and bottom 100 nodes
top_1000_df = merged_df[merged_df['asin'].isin(top_1000_nodes)]
bottom_1000_df = merged_df[merged_df['asin'].isin(bottom_1000_nodes)]
In [73]:
avg_average_rating = top_1000_df['average_rating'].mean()
avg_average_sentiment = top_1000_df['average_sentiment'].mean()

print("Average of average_rating in top_1000:", avg_average_rating)
print("Average of average_sentiment in top_1000:", avg_average_sentiment)


avg_average_rating = bottom_1000_df['average_rating'].mean()
avg_average_sentiment = bottom_1000_df['average_sentiment'].mean()

print("Average of average_rating in bottom_1000:", avg_average_rating)
print("Average of average_sentiment in bottom_1000:", avg_average_sentiment)
Average of average_rating in top_1000: 4.283205006623296
Average of average_sentiment in top_1000: 0.4726882623015374
Average of average_rating in bottom_1000: 4.104591659289994
Average of average_sentiment in bottom_1000: 0.5096076121207322

The node degree don't seem to have an effect on sentiment or rating, as we can see that averge for the top and bottom is almost the same

Top and bottom by price¶

In [74]:
sorted_df = merged_df.sort_values(by='price')
In [75]:
expensive_1000_item = sorted_df.tail(1000)
cheep_1000_items = sorted_df.head(1000)
In [76]:
avg_average_rating = expensive_1000_item['average_rating'].mean()
avg_average_sentiment = expensive_1000_item['average_sentiment'].mean()

print("Average of average_rating in top_1000:", avg_average_rating)
print("Average of average_sentiment in top_1000:", avg_average_sentiment)


avg_average_rating = cheep_1000_items['average_rating'].mean()
avg_average_sentiment = cheep_1000_items['average_sentiment'].mean()

print("Average of average_rating in bottom_1000:", avg_average_rating)
print("Average of average_sentiment in bottom_1000:", avg_average_sentiment)
Average of average_rating in top_1000: 4.189955511311651
Average of average_sentiment in top_1000: 0.5130592409297675
Average of average_rating in bottom_1000: 4.112984430621785
Average of average_sentiment in bottom_1000: 0.47387498297615827

We computed the same using price and find the same conclusion. Higher-priced products does not tend to receive more positive or negative feedback compared to lower-priced items.

Top and bottom groups¶

In [77]:
rating_groups = {}
sentiment_groups = {}
for group_id, group_df in merged_df.groupby('group'):
    rating_groups[group_id] = group_df['average_rating'].mean()
    sentiment_groups[group_id] = group_df['average_sentiment'].mean()
In [78]:
# Sort the dictionary based on values
sorted_dict = dict(sorted(sentiment_groups.items(), key=lambda item: item[1]))

# Print the top 5 elements
print("Bottom 5:")
for key, value in list(sorted_dict.items())[:5]:
    print("the 5 most common words for group",key,Counter(item_groups[key]).most_common(5))
    print(key, ":", value)

# Print the bottom 5 elements
print("\nTop 5:")
for key, value in list(sorted_dict.items())[-5:]:
    print("the 5 most common words for group",key,Counter(item_groups[key]).most_common(5))
    print(key, ":", value)
Bottom 5:
the 5 most common words for group 22.0 [('dog', 101), ('work', 90), ('spot', 45), ('grass', 44), ('use', 33)]
22.0 : 0.30083530636892436
the 5 most common words for group 41.0 [('bag', 82), ('dog', 31), ('use', 22), ('good', 21), ('like', 20)]
41.0 : 0.3063075255637271
the 5 most common words for group 20.0 [('dog', 2581), ('work', 1929), ('collar', 1815), ('use', 1198), ('bark', 1033)]
20.0 : 0.3488743727609684
the 5 most common words for group 14.0 [('cat', 41), ('claw', 14), ('nail', 14), ('get', 11), ('cap', 11)]
14.0 : 0.3560346153846154
the 5 most common words for group 44.0 [('dog', 26), ('fenc', 23), ('work', 21), ('one', 18), ('bumper', 17)]
44.0 : 0.37356504385964906

Top 5:
the 5 most common words for group 25.0 [('dog', 3), ('love', 2), ('collar', 2), ('last', 2), ('look', 1)]
25.0 : 0.6475500000000001
the 5 most common words for group 35.0 [('collar', 328), ('leash', 212), ('dog', 197), ('love', 161), ('lupin', 154)]
35.0 : 0.6542897007846774
the 5 most common words for group 1.0 [('bag', 10), ('food', 7), ('great', 6), ('dog', 5), ('perfect', 3)]
1.0 : 0.6761033333333334
the 5 most common words for group 17.0 [('cat', 66), ('bed', 49), ('love', 27), ('like', 20), ('one', 15)]
17.0 : 0.7922762195121952
the 5 most common words for group 10.0 [('toy', 5), ('one', 3), ('beak', 2), ('eyebal', 2), ('come', 2)]
10.0 : 0.8119000000000001
In [79]:
# Sort the dictionary based on values
sorted_dict = dict(sorted(rating_groups.items(), key=lambda item: item[1]))

# Print the top 5 elements
print("Bottom 5:")
for key, value in list(sorted_dict.items())[:5]:
    print("the 5 most common words for group",key,Counter(item_groups[key]).most_common(5))
    print(key, ":", value)

# Print the bottom 5 elements
print("\nTop 5:")
for key, value in list(sorted_dict.items())[-5:]:
    print("the 5 most common words for group",key,Counter(item_groups[key]).most_common(5))
    print(key, ":", value)
Bottom 5:
the 5 most common words for group 22.0 [('dog', 101), ('work', 90), ('spot', 45), ('grass', 44), ('use', 33)]
22.0 : 2.8783080790844764
the 5 most common words for group 4.0 [('help', 2), ('ice', 2), ('walk', 2), ('warm', 2), ('worth', 2)]
4.0 : 3.0
the 5 most common words for group 32.0 [('collar', 9), ('bright', 8), ('dog', 6), ('light', 5), ('batteri', 4)]
32.0 : 3.4711538461538463
the 5 most common words for group 39.0 [('dog', 11), ('toy', 9), ('love', 5), ('chewer', 5), ('product', 5)]
39.0 : 3.525
the 5 most common words for group 14.0 [('cat', 41), ('claw', 14), ('nail', 14), ('get', 11), ('cap', 11)]
14.0 : 3.5679487179487177

Top 5:
the 5 most common words for group 35.0 [('collar', 328), ('leash', 212), ('dog', 197), ('love', 161), ('lupin', 154)]
35.0 : 4.825128272558179
the 5 most common words for group 17.0 [('cat', 66), ('bed', 49), ('love', 27), ('like', 20), ('one', 15)]
17.0 : 4.926829268292683
the 5 most common words for group 1.0 [('bag', 10), ('food', 7), ('great', 6), ('dog', 5), ('perfect', 3)]
1.0 : 5.0
the 5 most common words for group 10.0 [('toy', 5), ('one', 3), ('beak', 2), ('eyebal', 2), ('come', 2)]
10.0 : 5.0
the 5 most common words for group 25.0 [('dog', 3), ('love', 2), ('collar', 2), ('last', 2), ('look', 1)]
25.0 : 5.0

By observing the top and bottom groups and their most common words we can't see any kind of trend between the type of product and the sentiment or rating but many of the top groups include dog so does the bottom.

In [80]:
nan_group_df = merged_df[merged_df['group'].isna()]
In [81]:
nan_group_df['average_rating'].mean()
Out[81]:
4.102630306984782
In [82]:
nan_group_df['average_sentiment'].mean()
Out[82]:
0.4927189153512092

We checked if the items not in a group have some abnormal rating or sentinment values, but it have values simular to the other groups. So whether if the item is in the largest component or not, have no impact on the sentiment or rating of the product.

Wordcloud for communities¶

To get a more clear idea of what the communites is based on we plot the wordcloud of the top 9 communities with most members.

In [83]:
top_9_groups = group_counter.most_common(9)

# Extract top 9 group ids
top_9_group_ids = [group[0] for group in top_9_groups]

# Filter df_filtered to include only the top 9 groups
df_filtered_top_9 = merged_df[merged_df['group'].isin(top_9_group_ids)]

# Create a 3x3 grid of subplots
fig, axes = plt.subplots(3, 3, figsize=(15, 15))

# Iterate over each group and corresponding axis
for (group_id, group_df), ax in zip(df_filtered_top_9.groupby('group'), axes.flatten()):
    # Flatten and concatenate tokens for the group
    group_tokens = ' '.join(chain.from_iterable(group_df['review_tokens']))
    
    # Generate word cloud for the group
    wordcloud = WordCloud(width=400, height=400, background_color='white').generate(group_tokens)
    
    # Display the word cloud on the corresponding subplot
    ax.imshow(wordcloud, interpolation='bilinear')
    ax.set_title('Word Cloud for Group {}'.format(int(group_id)))
    ax.axis('off')

# Adjust layout
plt.tight_layout()
plt.show()

In the communities with most members we see that many of the communities are about dogs, but within this there are different categories such as in group 1 which is about grooming and group 9 is about dog treat/food. Because we are using the reviews, some words might not be that informative and which are words such as: like, love etc. that don't help us much in finding the category of the community, but instead represent some sentiment value.

Discussion¶

Reflecting on our discussion, we had some wins and spots where we could do better in our analysis.One good thing was that we managed to find important insights. The main questions we wanted answered was possible to concluede to some extend. There are of course always more ways to filter and look at the data, but as it stands we got a clear understanding of what influences the co-purchase patterns.

However, we spotted some gaps we need to fill. One is in sentiment analysis. We noticed that sometimes the sentiment didn't match the ratings. This happened because we didn't handle words like "not good","didn't distroy" properly. So, some comments were misunderstood which might have skewed our analysis. Fixing this would make our understanding of customer feelings better and make our findings more accurate.

Also, it's important to use all the available data. While we did with what our computers could work with, including all the data could give us deeper insights and a fuller picture. This way, our conclusions would be based on a better understanding of the data, making our analysis stronger.

Looking ahead, there are steps we can take to make our analysis even better. For example, focusing on higher-core reviews—those with more than ten reviews—would help balance out any unusual reviews and give us a more accurate picture. Also, the data have a 2023 version, using newer data would keep our findings up-to-date and relevant.

We found that certain product categories appear in both the top and bottom sentiment rankings. Combining these similar categories into one group could provide a clearer understanding of their influence on sentiment and ratings.

To sum up, our analysis gave us useful insights, but there's room for improvement. By fixing these areas and making the suggested changes, we can make our analysis stronger and learn more about co-purchase patterns and related topics.

Conclusion¶

In our investigation, we aimed to uncover the factors influencing co-purchasing habits among pet supply customers and how these factors impact their product reviews. Additionally, we sought to determine whether price influences these reviews.

Through our analysis, we discovered that customers tend to co-purchase items primarily within the same animal species and product category, such as food or grooming products. While certain communities exhibit a more positive sentiment, some communities within the same category also appear in the less favorable rankings. Dogs are prevalent in both the top and bottom five groups, making it challenging to conclude whether a specific animal or category consistently leads to better product reviews. Additionally, our findings revealed that price does not significantly affect reviews.